37 research outputs found
Knowledge and pre-trained language models inside and out: a deep-dive into datasets and external knowledge
Pre-trained Language Models (PLMs) have greatly advanced the performance of various NLP tasks and have undoubtedly been serving as foundation models for this field. These pre-trained models are able to capture rich semantic patterns from large-scale text corpora and learn high-quality representations of texts. However, such models still have shortcomings - they underperform when faced with tasks that requires implicit external knowledge to be understood, which is difficult to learn with commonly employed pre-training objectives. Moreover, there lacks a comprehensive understanding of PLMs’ behaviour in learning knowledge during the fine-tuning phase. Therefore, in order to address the aforementioned challenges, we propose a set of approaches to inject external knowledge into PLMs and demonstrate experiments investigating their behaviour in learning knowledge during the fine-tuning phase, primarily focusing on Sentiment Analysis, Question Answering and Video Question Answering.
Specifically, we introduce novel approaches explicitly using textual historical reviews of users and products for improving sentiment analysis. To overcome the problem of context-question lexical overlap and data scarcity for question generation, we propose a novel method making use of linguistic and semantic knowledge with heuristics. Additionally, we explore how to utilise multimodal (visual and acoustic) information/knowledge to improve Video Question Answering.
Experiments conducted on benchmark datasets show that our proposed approaches achieve superior performance compared to state-of-the-art models, demonstrating the effectiveness of our methods for injecting external knowledge. Furthermore, we conduct a set of experiments investigating the learning of knowledge for PLMs for question answering under various scenarios. Results reveal that the internal characteristics of QA datasets can pose strong bias for PLMs when learning from downstream tasks datasets. Finally, we present an in-depth discussion of future directions for improving PLMs with external knowledge
Extending the Scope of Out-of-Domain: Examining QA models in multiple subdomains
Past works that investigate out-of-domain performance of QA systems have
mainly focused on general domains (e.g. news domain, wikipedia domain),
underestimating the importance of subdomains defined by the internal
characteristics of QA datasets. In this paper, we extend the scope of
"out-of-domain" by splitting QA examples into different subdomains according to
their several internal characteristics including question type, text length,
answer position. We then examine the performance of QA systems trained on the
data from different subdomains. Experimental results show that the performance
of QA systems can be significantly reduced when the train data and test data
come from different subdomains. These results question the generalizability of
current QA systems in multiple subdomains, suggesting the need to combat the
bias introduced by the internal characteristics of QA datasets.Comment: 14 pages, 6 figures, 29 tables, to appear at ACL 2022 Workshop on
Insights from Negative Results in NLP, code available in
https://github.com/lyuchenyang/Analysing-Question-Answering-Dat
New Trends in Machine Translation using Large Language Models: Case Examples with ChatGPT
Machine Translation (MT) has made significant progress in recent years using
deep learning, especially after the emergence of large language models (LLMs)
such as GPT-3 and ChatGPT. This brings new challenges and opportunities for MT
using LLMs. In this paper, we brainstorm some interesting directions for MT
using LLMs, including stylized MT, interactive MT, and Translation Memory-based
MT, as well as a new evaluation paradigm using LLMs. We also discuss the
privacy concerns in MT using LLMs and a basic privacy-preserving method to
mitigate such risks. To illustrate the potential of our proposed directions, we
present several examples for the new directions mentioned above, demonstrating
the feasibility of the proposed directions and highlight the opportunities and
challenges for future research in MT using LLMs
QAScore -- An Unsupervised Unreferenced Metric for the Question Generation Evaluation
Question Generation (QG) aims to automate the task of composing questions for
a passage with a set of chosen answers found within the passage. In recent
years, the introduction of neural generation models has resulted in substantial
improvements of automatically generated questions in terms of quality,
especially compared to traditional approaches that employ manually crafted
heuristics. However, the metrics commonly applied in QG evaluations have been
criticized for their low agreement with human judgement. We therefore propose a
new reference-free evaluation metric that has the potential to provide a better
mechanism for evaluating QG systems, called QAScore. Instead of fine-tuning a
language model to maximize its correlation with human judgements, QAScore
evaluates a question by computing the cross entropy according to the
probability that the language model can correctly generate the masked words in
the answer to that question. Furthermore, we conduct a new crowd-sourcing human
evaluation experiment for the QG evaluation to investigate how QAScore and
other metrics can correlate with human judgements. Experiments show that
QAScore obtains a stronger correlation with the results of our proposed human
evaluation method compared to existing traditional word-overlap-based metrics
such as BLEU and ROUGE, as well as the existing pretrained-model-based metric
BERTScore.Comment: 19 pages, 5 figures, 7 table
Improving document-level sentiment analysis with user and product context
Past work that improves document-level sentiment analysis by encoding user and product information has been limited to considering only the text of the current review. We investigate incorporating additional review text available at the time of sentiment prediction that may prove
meaningful for guiding prediction. Firstly, we incorporate all available historical review text belonging to the author of the review in question. Secondly, we investigate the inclusion of historical reviews associated with the current product (written by other users). We achieve this by
explicitly storing representations of reviews written by the same user and about the same product and force the model to memorize all reviews for one particular user and product. Additionally, we drop the hierarchical architecture used in previous work to enable words in the text to directly
attend to each other. Experiment results on IMDB, Yelp 2013 and Yelp 2014 datasets show improvement to state-of-the-art of more than 2 percentage points in the best case
Is a video worth n × n Images? A highly efficient approach to transformer-based video question answering
Conventional Transformer-based Video Question Answering (VideoQA) approaches generally encode frames independently through one
or more image encoders followed by interaction between frames and question. However,
such schema incur significant memory use and
inevitably slow down the training and inference
speed. In this work, we present a highly efficient approach for VideoQA based on existing
vision-language pre-trained models where we
concatenate video frames to a n × n matrix
and then convert it to one image. By doing
so, we reduce the use of the image encoder
from n 2 to 1 while maintaining the temporal
structure of the original video. Experimental
results on MSRVTT and TrafficQA show that
our proposed approach achieves state-of-theart performance with nearly 4× faster speed
and only 30% memory use. We show that
by integrating our approach into VideoQA systems we can achieve comparable, even superior, performance with a significant speed up
for training and inference. We believe the proposed approach can facilitate VideoQA-related
research by reducing the computational requirements for those who have limited access to budgets and resources. Our code is publicly available at https://github.com/lyuchenyang/
Efficient-VideoQA for research use
Dialogue-to-Video Retrieval
Recent years have witnessed an increasing amount of dialogue/conversation on
the web especially on social media. That inspires the development of
dialogue-based retrieval, in which retrieving videos based on dialogue is of
increasing interest for recommendation systems. Different from other video
retrieval tasks, dialogue-to-video retrieval uses structured queries in the
form of user-generated dialogue as the search descriptor. We present a novel
dialogue-to-video retrieval system, incorporating structured conversational
information. Experiments conducted on the AVSD dataset show that our proposed
approach using plain-text queries improves over the previous counterpart model
by 15.8% on R@1. Furthermore, our approach using dialogue as a query, improves
retrieval performance by 4.2%, 6.2%, 8.6% on R@1, R@5 and R@10 and outperforms
the state-of-the-art model by 0.7%, 3.6% and 6.0% on R@1, R@5 and R@10
respectively
Document-Level Machine Translation with Large Language Models
Large language models (LLMs) such as Chat-GPT can produce coherent, cohesive,
relevant, and fluent answers for various natural language processing (NLP)
tasks. Taking document-level machine translation (MT) as a testbed, this paper
provides an in-depth evaluation of LLMs' ability on discourse modeling. The
study fo-cuses on three aspects: 1) Effects of Discourse-Aware Prompts, where
we investigate the impact of different prompts on document-level translation
quality and discourse phenomena; 2) Comparison of Translation Models, where we
compare the translation performance of Chat-GPT with commercial MT systems and
advanced document-level MT methods; 3) Analysis of Discourse Modelling
Abilities, where we further probe discourse knowledge encoded in LLMs and
examine the impact of training techniques on discourse modeling. By evaluating
a number of benchmarks, we surprisingly find that 1) leveraging their powerful
long-text mod-eling capabilities, ChatGPT outperforms commercial MT systems in
terms of human evaluation. 2) GPT-4 demonstrates a strong ability to explain
discourse knowledge, even through it may select incorrect translation
candidates in contrastive testing. 3) ChatGPT and GPT-4 have demonstrated
superior performance and show potential to become a new and promising paradigm
for document-level translation. This work highlights the challenges and
opportunities of discourse modeling for LLMs, which we hope can inspire the
future design and evaluation of LLMs
Semantic-aware dynamic retrospective-prospective reasoning for event-level video question answering
Event-Level Video Question Answering (EVQA) requires complex reasoning across video events to obtain the visual information needed to provide optimal answers. However, despite significant progress in model performance, few studies have focused on using the explicit semantic connections between the question and visual information especially at the event level. There is need for using such semantic connections to facilitate complex reasoning across video frames. Therefore, we propose a semantic-aware dynamic retrospective-prospective reasoning approach for video-based question answering. Specifically, we explicitly use the Semantic Role Labeling (SRL) structure of the question in the dynamic reasoning process where we decide to move to the next frame based on which part of the SRL structure (agent, verb, patient, etc.) of the question is being focused on. We conduct experiments on a benchmark EVQA dataset - TrafficQA. Results show that our proposed approach achieves superior performance compared to previous state-of-the-art models. Our code is publicly available at https://github.com/lyuchenyang/Semantic-aware-VideoQA}